Cost-sensitive Web-based Information Acquisition for Record Matching
نویسندگان
چکیده
In many record matching problems, the input data is either ambiguous or incomplete, making the record matching task difficult. However, for some domains, evidence for record matching decisions are readily available in large quantities on the Web. These resources may be retrieved by making queries to a search engine, making the Web a valuable resource. On the other hand, Web resources are slow to acquire compared to data that is already available in the input. Also, some Web resources must be acquired before others. Hence, it is necessary to acquire Web resources selectively and judiciously, while satisfying the acquisition dependencies between these resources. This thesis has two major goals: 1. To establish that acquisition of web based resources can benefit the task performance of record matching tasks, and 2. To propose an algorithm for selective acquisition of web based resources for record matching tasks. It should balance acquisition costs and acquisition benefits, while taking acquisition dependencies between resources into account. This thesis has two major parts corresponding to the two goals. In the first part, I propose methods for using information from the Web for three different record matching problems, namely, author name disambiguation, linkage of short forms to long forms, and web people search. Thus, I establish that acquiring web based resources can improve record matching tasks. In the second and larger part, I propose approaches for selective acquisition of web based resources for record matching tasks, with the aim of balancing acquisition costs vii ABSTRACT and acquisition benefits. These approaches start from the more task-specific and move towards the more general and principled. I first propose a way for adaptively combining two methods for record matching, followed by a cost-sensitive attribute value acquisition algorithm for support vector machines. This work culminates in a framework for performing cost-sensitive resource acquisition problems with hierarchical dependencies, which is the main contribution in this thesis. This graphical framework is versatile and can apply to a large variety of problems. In the context of this framework, I propose an effective resource acquisition algorithm for record matching problems, taking particular characteristics of such problems into account. Finally, I proposed two benefit functions for use in my framework, corresponding to two different evaluation measures.and acquisition benefits. These approaches start from the more task-specific and move towards the more general and principled. I first propose a way for adaptively combining two methods for record matching, followed by a cost-sensitive attribute value acquisition algorithm for support vector machines. This work culminates in a framework for performing cost-sensitive resource acquisition problems with hierarchical dependencies, which is the main contribution in this thesis. This graphical framework is versatile and can apply to a large variety of problems. In the context of this framework, I propose an effective resource acquisition algorithm for record matching problems, taking particular characteristics of such problems into account. Finally, I proposed two benefit functions for use in my framework, corresponding to two different evaluation measures.
منابع مشابه
A Framework for Hierarchical Cost-sensitive Web Resource Acquisition∗
Many record matching problems involve information that is insufficient or incomplete, and thus solutions that classify which pairs of records are matches often involve acquiring additional information at some cost. For example, web resources impose extra query or download time. As the amount of resources that can be acquired is large, solutions invariably acquire only a subset of the resources ...
متن کاملA procedure for Web Service Selection Using WS-Policy Semantic Matching
In general, Policy-based approaches play an important role in the management of web services, for instance, in the choice of semantic web service and quality of services (QoS) in particular. The present research work illustrates a procedure for the web service selection among functionality similar web services based on WS-Policy semantic matching. In this study, the procedure of WS-Policy publi...
متن کاملCentralized Clustering Method To Increase Accuracy In Ontology Matching Systems
Ontology is the main infrastructure of the Semantic Web which provides facilities for integration, searching and sharing of information on the web. Development of ontologies as the basis of semantic web and their heterogeneities have led to the existence of ontology matching. By emerging large-scale ontologies in real domain, the ontology matching systems faced with some problem like memory con...
متن کاملTowards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach
Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, howev...
متن کاملامنیت اطلاعات سامانه های تحت وب نهاد کتابخانه های عمومی کشور
Purpose: This paper aims to evaluate the security of web-based information systems of Iran Public Libraries Foundation (IPLF). Methodology: Survey method was used as a method for implementation. The tool for data collection was a questionnaire, based on the standard ISO/IEC 27002, that has the eleven indicators and 79 sub-criteria, which examines security of web-based information systems of IP...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011